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Abstract 

A possible mathematical description of partial knowledge, ability, and 
willingness to answer in multiple -choice tests is as follows. Each alternative 
in each item is envisaged to generate within the subject a certain feeling of 
mismatch to the question asked. The strength of this tends to be greater for 
incorrect alternatives than for correct answers, though there is significant 
random variation. The subject chooses the alternative which has gi~ en rise to 
the lowest mismatch, except that if this minimum mismatch is larger than some 
threshold, the question is left unanswered. (Of course, the threshold would 
probably be influenced by the instructions concerning guessing.) Assuming some 
statistical distribution of mismatch, we may obtain the proportions of items 
answered correctly and answered wrongly, in terms of ability and willingness 
to guess. 

Many different forms of multiple-choice test have from time to time been used, 
either for practical or theoretical reasons. Among them have been ones in 
which: confidence ratings are required; answers are required to items initially 
omitted; a second choice is permitted when the first answer is wrong; the task 
is to identify as many as possible of the wrong alternatives; some items are 
repeated (usually disguised); for some items, no correct answer is among the 
alternatives available. 

The approach outlined in the first paragraph helps us understand performance in 
such tests. Two datasets have received detailed attention. One was an answer- 
unlil-correct test of spatial reasoning (386 examinees, 30 items, 5 alternatives). 
Evidence for the operation of partial knowledge was given by two findings: . 4 
(i) performance when second and subsequent choices are made (after the first 
choice is wrong) is above the chance level, and (ii) is positively related to 
first-choice performance. The second dataset was a 4-alternative test of 
chemistry administered to 407 subjects. There were 20 genuine items plus 4 
nonsense items. The rroportions of the genuine items answered correctly, 
answered incorrectly, and omitted can be used to predict the proportion of the 
nonsense items ^ctcmptcd. Tairly guod agreement between predicted and observed 
proportions was found. 




UA DEPARTMENT OF EDUCATION 
NATIONAL INSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 



0 



CENTER tERiCl 



MO 



Jtut document has been reproduced is 
received from I he person or organuatofi 
onytna hng *i 

Mao/ c*ift$e$ have been made lo improve 
reproduction quality 




• Po*r»t$ ol view or options stated m this docu 
mem do not necessary i epresent oHttal NlE 
PCS 'ton or poKy 



ERJC 




2 



!• Introduction 

Lumsden (1976) began his review of test theory with the comment 
"there has been a general atmosphere of melancholia and lassitude among 
latter-day test theorists ... The shreds of theory that have been developed 
and the time-worn true score models are not rich sources of ideas about 
testing". And he concluded "The picture revealed is grim ... Little of any 
consequence has been achieved ... It is only slightly unfair to say that 
test theory has failed as theory ... I have supped my fill of horrors". 

Hutchinson (1982) adapted some ideas from signal detection theory to 
attempt some advance. He supposed that choosing the correct answer from 
several alternatives was akin to choosing which of several intervals contained 
a signal. That is, each alternative gives rise within the subject to a feeling 
of inappropriateness - inappropriateness, that is, to the question asked. This 
mismatch between question and answer tends to be higher for the incorrect 
alternatives than for the correct one, but there is substantial random 
variation and a significant degree of overlap. Note the analogy with signal 
detection theory, in which both noise and signal-plus-noise have statistical 
distributions (which overlap) of their internal representation. 

Hutchinson's theory is in many ways crude - there is no attempt at 
mechanistic realism in modelling the subject's processes of thought in 
attacking a problem, dot at least it is able to predict how the subject will 
behave when the format of the test is changed - when a second attempt is 
permitted at questions answered wrongly, f or instance. At least the theory 
is not as bad as assuming the only alternative to perfect knowledge is 
no knowledge and hence random guessing. 

Some empirical evidence was adduced by Hutchinson (1982) in support of 
his viewpoint: 

- When subjects assign a confidence rating to their answer, the higher the 
level of confidence, the greater is the probability of being correct. 

- When an answer is required to items initially left unanswered, a 
higher-than-chance proportion are found to be correct. 

- Scores calculated with the conventional guessing correction (based upon 
all-or-none knowledge) are higher under "Attempt every item" directions than 
under "Omit the item if your answer would be a guess" directions. 

- When subjects have a second attempt at items answered wrongly, a 
higher-than-chance proportion are found to be correct. 

I have now reanalysed two datasets kindly made available to me: one 
of responses in an answer-until-correct test, and one of a conventional test 
with which were included some nonsense items. Brief accounts of the results 
..I.Lai nod are given below; further detail-, arc qiven in Hutchinson (1985) for 

CD?/-" th<? form£>r ' and in Frary und "utchinson (1902) and Hutchinson (1984) for the 
cKJl latter. Q 



2 * A description of partial kn owledge analc.ous to t-h» e ^ nal detectlon 
model of perception 

2 • 1 5£Sii2i!i^_Dismssal_of_models_^at_ e 

Descriptions of subjects' reactions to some types of item"may"poIIeir a 
degree of mechanistic realism. For instance, a subject may know something 
about a particular alternative answer that eliminates it from consideration. 
(Asked to indicate whether Paris or Rome is the capital of France, the 
correct answer is given if the subject knows that Rome is the capital of 
Italy.) as a second example, the product of 2h and 3", may be known to lie 
between lh and 10, thus eliminating alternatives such as Sk and llh, 
without the full details of multiplying fractions being known. I do not 
quite rule out the feasibility of tailoring theories of partial knowledge 
to fit specific types of item. But since most tests contain items of many 
types, and since what is usually wanted is a single score representing some 
form of general ability, I think a theory of general applicability is to be 
preferred, even if by its abstractness it loses mechanistic realism. 

-• 2 9i2££i^tions_of mismatch 

We shall adapt our ideas from signal detection theory (Green and Swets, 1966) . 
To explain errors when a subject is attempting to detect a faint stimulus, 
this supposes the subject responds according to whether the level of some 
internal sensation exceeds or falls below a threshold level; and that the 
sensation is variable (i.e. has some statistical distribution), both when 
the stimulus is presented and when it is not, the average levels being 
different in the two conditions. Similarly, we shall suppose that each 
alternative in each item generates within the subject a certain feeling of 
xnappropriateness to the question posed. This feeling tends to be stronger 
for the incorrect alternatives than for the correct one, though there is 
appreciable random variation. The subject normally chooses the alternative 
that generated the lowest mismatch. But if all exceeded some threshold level 
then the subject is unwilling to answer. (This threshold is naturally affected 
by the instructions given concerning guessing.) 



ERIC 



4 



2.3 Mathematical expression 
Notation: 

N = number of alternatives in each item, 
c = proportion o£ items answered correctly, 
w » proportion of items answered wrongly, 
u » proportion of items not answered, 

X represents the inappropriate ness of an alternative. The greater the 
difference between its average levels for correct and for incorrect 
alternatives, the easier is the item (or the cleverer is the subject) . 
Denote the distributions of X under the two conditions by F and G: 
Probability of X exceeding the value x for correct alternatives - F(x) , 
Probability of X exceeding x for incorrect alternatives « G(x) , 
F(x) being less than G(x) . 

T represents the response threshold, such that if the inappropriateness 
level exceeds this for all the alternative answers to an item, no choic 
is made. 



We can now write down equations for u, c, and w in terms of F and G: 

r 1 N-l 

u ~ F(T) !G< T >] 



-dF(x) [v, ,nN-l _ 



W = 1 - U - C 

What t^ie second of these equations is saying is that the probability 
of the inappropriateness generated by the correct alternative taking a 
value x is the probability density corresponding to F(x) , ~^ x) ; 
that the probability of all the N-l incorrect alternatives having higher 
levels of inappropriateness is [gU)] 1 *" 1 ; that the probability of both 
these things being true is the product of the probabilities; and, finally, 
we need to consider all possible values of x less than T, so we sum with T 
being the limit of integration, 

For a given item, ability is measured by how different F and G are. 
So choose them so as to jointly contain a single parameter characterising 
ability, A, and obtain A in terras of c and w by eliminating t from the 
above equations. Some examples will show how this is done. (Further 
implications of this model of performance are described in Hutchinson, 1982.) 
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2,4 Sgecific^examgles 

Some choices of F and G give rise to a simple txprenion for the ability 
parameter A. 



Firstly, for 0 < x < 1 and a * 1, let 
F(x) - 1 - x 



G(x) = 



r X(l-x) ,/4< (1-A"*<x<l) , 



(1) 



1 {0<x<l-X*) • 

In this case, 1-A a - c-av/(N-l) , so that, since the left-hand side of 
the equation is an increasing function of X, we have derived the general 
linear correction for guessing, each correct answer receiving 1 mark 
and each wrong answer receiving -a/(N-l) marks. The conventional formula 
is obtained by setting a = 1. 



(2) 



Secondly, suppose that for x > 0, 
F(x) = exp (-x) 
Or(x) = exp (-x/A) 

Then X/(N-1) c/w. If F(x) - 1 - x and G(x) ■ (1 - x) g ^ same 
equation is obtained, illustrating that a particular formula for X 
does not imply a unique pair of functions F and G. 
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3. Performance in an answer-until-correct test 

3.1 Introduction 

In answer-until-correct (AUG) tests, the subjects receive immediate 
feedback as to whether the selected answer was correct or not- If it was 
wrong, they then choose another answer, and continue attempting the item 
until the correct answer is found. This practice dates back at least as far 
as S L Pressey's work in the 1020's. The principal advantages claimed are: 

- The greater information per item provided. Hence higher reliability and, 
perhaps, validity of the test. 

- The feedback is liked by the subjects, and the positive attitude oroducud 
helps to motivate them. 

- Subjects learn from the feedback. 

3.2 Theory 

The probability of Hie correct alternative having the second- lowest mismatch is 

The (conditional) probability of giving th< correct answer when the necond 

* 

choice is made is thus c^ c,/(l~c). 

Five choices of pairs of distributions F and n will be considered further. 
The theories arisinq will be referred to as C, £$, E, N, and L: C for a 
tneory that is equivalent to the conventional all-or-nothing learning theory, 
ES for exponential distributions plus a special state, and E, N, L for 
exponential , normal , logistic distributions . 

C. The simplest case is when F is a rectangular (uniform) distribution over 
the range 0 to 1 and G is a rectangular distribution over the range 1-X to 1 
(i e equations (1) with*~l). Thus a value of X between O and 1-A t an only 
arise from the correct alternative - the correct .m <we r is known. r>v«*r the 
range i~A to 1 the ratio of the probability densi^ie:* ts a constant * there 
I-! no partial information. c 7 i;i 1/(N-1) whatever c ir.. 

E. The next most straightforward case is when F and G are both exponential 
distributions over the range O to-, with different exponents (i e 
equations (2)). There is no state in which the correct answer is known 
without error, but a continuous variation of the degree of partial information 
over the whole range, in the case of S-alternativ« items, the relation between 
c and c 0 turns nut to be e t - 4c/ H****). 

ES. This case has some features of C and som<* d n. F is an exponential 
distrihut ion over the range o toco, G is an exponential dir.tribution over 



the range A to A value of X between 0 and * can arise only from the 
correct alternative - the correct answer is known. For larger X, there is 
a continuous variation of partial information. It is possible to express 
c a* an explicit function of but not vico versa (see Hutchinson, 1985). 

N. F and G are both normal distributions over the range to *>, with 
different moans but the same standard deviation. Again, no special state 
corresponding to certain knowledge. An explicit relation between c and 
cannot be obtained, so it is necessary to resort to numerical integration. 

L. As N, but with logistic distributions instead of normal distributions. 
3.3 Data 

The data is that reported by Whetton and Childs (1981) . The subjects were 
386 school pupils, in the third year of secondary school. The test was 
designed to give a measure of spatial reasoning ability. It consisted of 
30 items all having the same format; each item presents a flag flying on a 
flagpole around which there is a circle. The flag is then shown in a 
second picture blowing in a different direction. The subject has to first 
judqe the direction in which the flag is flying relative to the marked 
position on the circle and then work out where he or she would have to 
mov< to on the circle's periphery to see the flag as it is shown in the 
second picture. Five alternative positions were given, and the test was 
administered in answer-unti I-corroct format* 



3.4 Results 

3.4.1 Before resorting to the sophisticated models of 3.2 above, let us 
observe that certain simple features of the data demonstrate that some 
fonii of partial knowledge in operating. Firstly: the proportion of items 
answered correctly at the second attempt was 0.39 (higher than the chanca 
level of 0.25); the proportions of items answered correctly when it came to 
the third and fourth attempts were 0.42 and 0.56 (higher than the chance 
levels of 0.33 and O.So). Secondly: subjects getting a hiqh proportion 
of items correct at first attempt tend also to get a high proportion correct 
at e Con d attemft (correlation = 0.68); subjects getting a high proportion 
correct within two attempts tend also to get a high proportion correct if 
a tnird attempt i:s n« cessarv (correlation = 0.42} . 

3.4.*.* To compare how well the five theories of 3.? fit the data, each 
^ubfecr's responses v;ere condensed to three categories - the number of 
ERiC ltGDS anSWered corrccti >' flrsfc time, the number for which two attempts 



were required, and the number for which three or more attempts were required. 
Each of the five theories makes a prediction about how these numbers are 
inter- related, and a value of chi-squared can bo calculated to measure 
the degree to which the data departs from the theory. {Each theory 
requires one parameter, representing ability, to be fitted to each subject's 
data.) When the values of chi-squared were summed over all subjects, the 
following results were found: 

C ES E N L 

1770 12G9 490 *\$9 ^ 

rf a theory were correct, chi-squared would be expected to be 386, since 
each subject contributes one degree of freedom. So even the best of the 
theories is not perfect. But more significant is that theory C, implying 
knowledge is all-or-none, performs much worse than the theories incorporating 
partial knowledge. 

3.4.3 The items varied in difficulty whereas tin- theories require the 
estimation of on ability parameter that is the same for all of a number of 
items for a given subject. Therefore, the analysis was repeated with the 
test split into three subsets of items, within each of which there was 
less variation of item difficulty than in the test as a whole. The 
difficult set consisted of 11 items for which the proportion of subjects 
answering them correctly at first attempt was between 0.2b and O.^S. The 
medium set consisted of 11 items for which this proportion v-as between 
0.36 and 0.42. The easy sot consisted of 8 items for which this proportion 
wis between 0.4 r > and 0.71. The results were not appreciably different from 
those for the- whole test. 
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4, Attempting nonsense item s 

4 .1 Introduction 

One means sometimes used to qain more information about the processes 
operating xn multiple-choice tests is to include among the items some which 
have no correct answer among the alternatives available. This has generally 
been done for research reasons, not edjcational ones, though Granich (1931) 
did suggest announcing their inclusion as a deterrent to random guessing. 
Inclusion of nonsense items dates back more than 50 years {English, 1928; 
Thelin and Scott, in.Hi) , but perhaps the largest body of work on tftis 
subject is by Slakter and colleagues. He uses the term "risk taking on 
objective examinations" (rtooe) to refer to the propensity to attempt 
nonsense items and to ZiLler's index (see below) for legitimate items. 
Slakter (1^69) reported the administration to 636 subjects of 4 tests 
(language aptitude, mathematics aptitude, language achievement, mathematics 
achievement) . These each included 10 nonsense questions embedded in 30 or 
40 legitimate questions. Measures of rtooe wore calculated from the 
nour^nse items (proportion attempted) and from the legitimate items (by 
filler's method). Slakter found (i) these two measures positively correlated 
and tii) rtooe appeared to be a general trait, in the sense that there was a 
positive correlation between different tests. From this and other studies 
he concludes that rtooe in a feature of personality, and related to 
dominano*^ubmis:.ion , maladjustment , vocational choice, curriculum choice, 
and perception of risk in military situations. 

4.2 Theory 

Following the approach of Section 2 t we assume that the probability 
distribution of mismatch for the alternatives given for the nonsense 
items is the same as that for the incorrect alternatives in the genuine 
item-. Then the probability of leaving this nonsense item unanswered is 
fG(T)] M f m which ca^e the probability of giving an answer (denoted a) is 



If .* matieiis (I) uald, then in the special ca^ # ~ I, 

x = (3) 

(IJ-i)tl * tlw 

(it rwr b^«»n assumed that ail itens are sufficiently difficult for all 
sublets to have a non-zero probability of giving a wrong answer), or, 
for general *, 



i - [cm]". 
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In the case * « 1, equations (1) aro equivalent to the conventional Tiodel 
that each subject knc*s the answer with probability p (specific to the subject , 
reflecting his or her abilxty) and decides to guess with probability q if he 
or she does not, in which case the probability of being correct is 1/N. Then 
for nonsense items, the probability of response wall presumably be j; it turns 
out that g is given by (3) (Zil lor ,1957) . 

Turning now to the case where equations (2) hold, a little algebra .shows that 
a=1 _ u Nw/[(N-l)(l-u)3 (5) 



4.3 Data 

Cross and Frary (1977) report the administration of a 4-alternative 20-ifc* - in 
test of chemistry to 407 subjects. As well -is the 20 genuine items, there 
w«*re 4 nonsense items included. The directions t v > the subjects wer^ designed 
tu encourage informed guesninq but discourage wild quessinqs "Your score 
will be the number of items you mark correctly minus a fraction of the number 
you mark incorrectly. You should answer questions even when you are not sure 
your answers are correct. This is especially true if you can eliminate one 
or more choices a?* incorrect or have a hunch or feelinq about which choice is 
correct. However, it is better to oirat an item than to quess wildly among all 
of the choices given." 

4.4 Results 

Because each subject was exposed to only 4 nonsense items, the following 
procedure was adopted. The subjects w**re grouped into ramjet according to 
their value >f ■»£{ reunion (i). Then tin* m»*iu proportion rionsei is * • items 
winch were answered by Lh» i subjects in e«tt?ii jrouj was found for mpar i son . 
r anally, the process was repeated with subnets beinq aroupi_-d according to 
their value of expression (5) § rather than (3). 

/Jhen this was done, it was found that {i) both variants of this theory had 
soi£e success, xn that there was a moderate correlation between th*_ ;axdiction 
and the findings, {ii) both tended to overestimate , and (iii) fomula ( r ) 
appeared to be slightly better than formula (33« 

Also calculated, this on a subji-ct-by -sub ]<*cr basis, was th<* * orr* lation 
r**twuen th»- actuil froportiois of nonsense i t « -ns answered (which e >uld only 
f ak«* Iht! valuer; O f l m , 4 f I) «md th«- (two) predict"! proportions. Thi s was 
found to be m 4i* xn the ease of e>:i»re-.sion ( {} and . in tin* cas** of «.*ssi. 
(s). YryiiKj different values of •< in (4), a maximum correlation uf . t* 4 was 
obtainable: this occurred when « - 0. IThough the theories say that a shuuld 

ERIC 

11 



take particular values for given u and w, not merely that a should be 
correlated with them, the possibility that the abstractors for the nonsense* 
items may not have been of the same attractiveness as the detractors for the 
genuine items suggests we should be interested in how high the correlations 
are, as well as in how small are the differences between prediction? 
and results.) 
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5. Conclusion 

For over 50 years, the overwhelming weight of evidence has been that 
subjects are able to make use of partial information when responding to 
multiple-choice items. There has, however, sometimes been a question as to 
whether they should be permitted to benefit from this. One school of 
thought says we arc trying to estimate the number of test items that the 
subject knows ; and consequently if we have evidence that he or she is getting 
many right because of partial information, we should make a large deduction 
from the number he or she gets right in order to obtain the number known; we 
are therefore penalising partial information. The alternative line of reasoning 
is that all information is useful, even when it is not complete; that the 
distinction between full and incomplete information is either not valid or 
merely a matter of degree; that -he subject should be credited for the partial 
information he or she has. The importance in real life of having to act on 
incomplete information and make intelligent guesses is adduced in support of 
this. This dichotomy of opinions has rarely been given explicit attention, 
though I think it is what Moy and Chou (1982) are getting at in their first 
paragraph on page five. I take the second view, that partial information is 
valuable and should be credited, and the structure of the theory of Section 2 
reflects this. It is, however, an assumption, and ultimately depends on 
intuitive notions about the relation between performance on tests and in the 
real world, and on validity studies of this relation. 

I believe I have shown it is practicable (i) to use variant formats of 
multiple -choice tests to compare different quantitative descriptions of partial 
knowledge, and (n) to allow a subject's partial knowledge to contribute to 
the estimate of his or her ability. There must be many datasets that are 
potentially suitable for comparing theories; I would be very interested to 
hear from anyone wanting to collaborate in their analysis. 
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